Search CORE

Dagstuhl Research Online Publication Server

Fractional Hitting Sets for Efficient and Lightweight Genomic Data Sketching

Author: Limasset Antoine
Marchet Camille
Martayan Igor
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)
Publication date: 01/01/2023
Field of study

The exponential increase in publicly available sequencing data and genomic resources necessitates the development of highly efficient methods for data processing and analysis. Locality-sensitive hashing techniques have successfully transformed large datasets into smaller, more manageable sketches while maintaining comparability using metrics such as Jaccard and containment indices. However, fixed-size sketches encounter difficulties when applied to divergent datasets. Scalable sketching methods, such as Sourmash, provide valuable solutions but still lack resource-efficient, tailored indexing. Our objective is to create lighter sketches with comparable results while enhancing efficiency. We introduce the concept of Fractional Hitting Sets, a generalization of Universal Hitting Sets, which uniformly cover a specified fraction of the k-mer space. In theory and practice, we demonstrate the feasibility of achieving such coverage with simple but highly efficient schemes. By encoding the covered k-mers as super-k-mers, we provide a space-efficient exact representation that also enables optimized comparisons. Our novel tool, SuperSampler, implements this scheme, and experimental results with real bacterial collections closely match our theoretical findings. In comparison to Sourmash, SuperSampler achieves similar outcomes while utilizing an order of magnitude less space and memory and operating several times faster. This highlights the potential of our approach in addressing the challenges presented by the ever-expanding landscape of genomic data

Navigating in a sea of repeats in RNA-seq without drowning

Author: Lacroix Vincent
Marchet Camille
Miele Vincent
Sacomoto Gustavo
Sagot Marie-France
Sinaimeri Blerina
Publication venue
Publication date: 01/01/2014
Field of study

The main challenge in de novo assembly of NGS data is certainly to deal with repeats that are longer than the reads. This is particularly true for RNA- seq data, since coverage information cannot be used to flag repeated sequences, of which transposable elements are one of the main examples. Most transcriptome assemblers are based on de Bruijn graphs and have no clear and explicit model for repeats in RNA-seq data, relying instead on heuristics to deal with them. The results of this work are twofold. First, we introduce a formal model for repre- senting high copy number repeats in RNA-seq data and exploit its properties for inferring a combinatorial characteristic of repeat-associated subgraphs. We show that the problem of identifying in a de Bruijn graph a subgraph with this charac- teristic is NP-complete. In a second step, we show that in the specific case of a local assembly of alternative splicing (AS) events, we can implicitly avoid such subgraphs. In particular, we designed and implemented an algorithm to efficiently identify AS events that are not included in repeated regions. Finally, we validate our results using synthetic data. We also give an indication of the usefulness of our method on real data

arXiv.org e-Print Archive

Crossref

HAL Descartes

Minimal perfect hash functions in large scale bioinformatics Problem

Author: Bittner Lucie
Limasset Antoine
Marchet Camille
Peterlongo Pierre
Publication venue: HAL CCSD
Publication date: 28/06/2016
Field of study

International audience. Genomic and metagenomic fields, generating huge sets ofshort genomic sequences, brought their own share of high performanceproblems. To extract relevant pieces of information from the huge datasets generated by current sequencing techniques, one must rely on extremelyscalable methods and solutions. Indexing billions of objects isa task considered too expensive while being a fundamental need in thisfield. In this paper we propose a straightforward indexing structure thatscales to billions of element and we propose two direct applications ingenomics and metagenomics. We show that our proposal solves probleminstances for which no other known solution scales-up. We believe thatmany tools and applications could benefit from either the fundamentaldata structure we provide or from the applications developed from thisstructure

Playing hide and seek with repeats in local and global de novo transcriptome assembly of short RNA-seq reads

Author: Lacroix Vincent
Lima Leandro
Lopez-Maestre Helene
Marchet Camille
Miele Vincent
Sacomoto Gustavo
Sagot Marie-France
Sinaimeri Blerina
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

International audienceAbstractBackground The main challenge in de novo genome assembly of DNA-seq data is certainly to deal with repeats that are longer than the reads. In de novo transcriptome assembly of RNA-seq reads, on the other hand, this problem has been underestimated so far. Even though we have fewer and shorter repeated sequences in transcriptomics, they do create ambiguities and confuse assemblers if not addressed properly. Most transcriptome assemblers of short reads are based on de Bruijn graphs (DBG) and have no clear and explicit model for repeats in RNA-seq data, relying instead on heuristics to deal with them.ResultsThe results of this work are threefold. First, we introduce a formal model for representing high copy-number and low-divergence repeats in RNA-seq data and exploit its properties to infer a combinatorial characteristic of repeat-associated subgraphs. We show that the problem of identifying such subgraphs in a DBG is NP-complete. Second, we show that in the specific case of local assembly of alternative splicing (AS) events, we can implicitly avoid such subgraphs, and we present an efficient algorithm to enumerate AS events that are not included in repeats. Using simulated data, we show that this strategy is significantly more sensitive and precise than the previous version of KisSplice (Sacomoto et al. in WABI, pp 99–111, 1), Trinity (Grabherr et al. in Nat Biotechnol 29(7):644–652, 2), and Oases (Schulz et al. in Bioinformatics 28(8):1086–1092, 3), for the specific task of calling AS events. Third, we turn our focus to full-length transcriptome assembly, and we show that exploring the topology of DBGs can improve de novo transcriptome evaluation methods. Based on the observation that repeats create complicated regions in a DBG, and when assemblers try to traverse these regions, they can infer erroneous transcripts, we propose a measure to flag transcripts traversing such troublesome regions, thereby giving a confidence level for each transcript. The originality of our work when compared to other transcriptome evaluation methods is that we use only the topology of the DBG, and not read nor coverage information. We show that our simple method gives better results than Rsem-Eval (Li et al. in Genome Biol 15(12):553, 4) and TransRate (Smith-Unna et al. in Genome Res 26(8):1134–1144, 5) on both real and simulated datasets for detecting chimeras, and therefore is able to capture assembly errors missed by these methods

Archivio della ricerca- LUISS Libera Università Internazionale degli Studi Sociali Guido Carli di Roma

PubMed Central

Hal-Diderot

Colib'read on galaxy : a tools suite dedicated to biological information extraction from raw NGS reads

Author: Alves-Carvalho Susete
Andrieux Alexan
Cazaux Bastien
Collin Olivier
El Aabidine Amal Zine
Lacroix Vincent
Le Bras Yvan
Lemaitre Claire
Marchet Camille
Miele Vincent
Monjeaud Cyril
Peterlongo Pierre
Rivals Eric
Sacomoto Gustavo
Salmela Leena
Uricaru Raluca
Publication venue
Publication date: 01/02/2016
Field of study

Background: With next-generation sequencing (NGS) technologies, the life sciences face a deluge of raw data. Classical analysis processes for such data often begin with an assembly step, needing large amounts of computing resources, and potentially removing or modifying parts of the biological information contained in the data. Our approach proposes to focus directly on biological questions, by considering raw unassembled NGS data, through a suite of six command-line tools. Findings: Dedicated to 'whole-genome assembly-free' treatments, the Colib'read tools suite uses optimized algorithms for various analyses of NGS datasets, such as variant calling or read set comparisons. Based on the use of a de Bruijn graph and bloom filter, such analyses can be performed in a few hours, using small amounts of memory. Applications using real data demonstrate the good accuracy of these tools compared to classical approaches. To facilitate data analysis and tools dissemination, we developed Galaxy tools and tool shed repositories. Conclusions: With the Colib'read Galaxy tools suite, we enable a broad range of life scientists to analyze raw NGS data. More importantly, our approach allows the maximum biological information to be retained in the data, and uses a very low memory footprint.Peer reviewe

Springer - Publisher Connector

Helsingin yliopiston digitaalinen arkisto

PubMed Central

Hal-Diderot

Playing hide and seek with repeats in local and global de novo transcriptome assembly of short RNA-seq reads

Author: B Li
Blerina Sinaimeri
Camille Marchet
EW Myers
F Freyermuth
G Robertson
G Sacomoto
Gustavo Sacomoto
H Lopez-Maestre
H Tilgner
Helene Lopez-Maestre
J Jurka
JT Robinson
Leandro Lima
M Bern
M Schulz
Marie-France Sagot
MG Grabherr
ML Carroll
P Novák
R Smith-Unna
S Djebali
T Griebel
Vincent Lacroix
Vincent Miele
W Bao
WJ Kent
Y Peng
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

From reads to transcripts : de novo methods for the analysis of transcriptome second and third generation sequencing

Author: Marchet Camille
Publication venue
Publication date: 28/09/2018
Field of study

Le but de ce travail de thèse est de permettre le traitement de données issues du séquençage de transcriptomes, c'est-à-dire les séquences d'ARN messager, qui reflètent l’expression des gènes. Plus précisément, il s’agit mettre à profit les caractéristiques des données produites par les nouvelles technologies de séquençage, dites de troisième génération (TGS). Ces technologies produisent des séquences de grande taille, qui peuvent couvrir la longueur totale des molécules d'ARN. Ceci présente l’avantage d’éviter la phase d’assemblage des séquences, une étape source de difficultés et d'erreurs mais nécessaire avec les données générées par les précédentes technologies de séquençage appelées NGS. En revanche, les données TGS sont bruitées (jusqu’à 15% d’erreurs de séquençage), nécessitant le développement de nouveaux algorithmes pour analyser ces données. Les travaux de cette thèse ont essentiellement consisté au développement méthodologique et à l’implémentation de nouveaux algorithmes permettant le regroupement des séquences TGS par gène, puis à leur correction et enfin à la détection des différents isoformes de chaque gène.The purpose of this thesis work is to allow the processing of transcriptome sequencing data, i.e. messenger RNA sequences, which reflect gene expression. More precisely, it is a question of taking advantage of the characteristics of the data produced by the new sequencing technologies, known as third generation (TGS). These technologies produce large sequences, which cover the total length of RNA molecules. This has the advantage of avoiding the sequence assembly phase, which was tricky, though necessary with the data generated by previous sequencing technologies called NGS. On the other hand, TGS data are noisy (up to 15% sequencing errors), requiring the development of new algorithms to analyze this data. The core work of this thesis consisted in the methodological development and implementation of new algorithms allowing the grouping of TGS sequences by gene, then their correction and finally the detection of the different isoforms of each gene

Theses.fr

Des lectures aux transcrits : méthodes de novo pour l’analyse du séquençage des transcriptomes de deuxième et troisième génération.

Author: Marchet Camille
Publication venue: HAL CCSD
Publication date: 28/09/2018
Field of study

The purpose of this thesis work is to allow the processing of transcriptome sequencing data, i.e. messenger RNA sequences, which reflect gene expression. More precisely, it is a question of taking advantage of the characteristics of the data produced by the new sequencing technologies, known as third generation (TGS). These technologies produce large sequences, which cover the total length of RNA molecules. This has the advantage of avoiding the sequence assembly phase, which was tricky, though necessary with the data generated by previous sequencing technologies called NGS. On the other hand, TGS data are noisy (up to 15% sequencing errors), requiring the development of new algorithms to analyze this data. The core work of this thesis consisted in the methodological development and implementation of new algorithms allowing the grouping of TGS sequences by gene, then their correction and finally the detection of the different isoforms of each gene.Le but de ce travail de thèse est de permettre le traitement de données issues du séquençage de transcriptomes, c'est-à-dire les séquences d'ARN messager, qui reflètent l’expression des gènes. Plus précisément, il s’agit mettre à profit les caractéristiques des données produites par les nouvelles technologies de séquençage, dites de troisième génération (TGS). Ces technologies produisent des séquences de grande taille, qui couvrent la longueur totale des gènes. peuvent couvrir la longueur totale des molécules d'ARN. Ceci présente l’avantage d’éviter la phase d’assemblage des séquences, ce qui était une étape source de difficultés et d'erreurs mais nécessaire avec les données générées par les précédentes technologies de séquençage appelées NGS. En revanche, les données TGS sont bruitées (jusqu’à 15% d’erreurs de séquençage), nécessitant le développement de nouveaux algorithmes pour analyser ces données. Les travaux de cette thèse ont essentiellement consisté au développement méthodologique et à l’implémentation de nouveaux algorithmes permettant le regroupement des séquences TGS par gène, puis à leur correction et enfin à la détection des différents isoformes de chaque gène

Thèses en Ligne